Analysis of Crime within the District of Columbia
Introduction
This data set contains information on 153687 felony arrests from the District of Columbia between the years 2013-2017. The information provided by this data set is readily available on the District of Columbia’s Police Department website. According to the dossier attached to this dataset, a felony arrest is defined as the police taking into custody a person that is suspected of having committed a crime. Although an arrest occurred, information about the conviction for the crime was not included. As a result, the modeling performed in this report has been done with the assumption that there is a 100% conviction rate for all arrests. This is unlikely to occur so some deviation from the model can be expected.
In this dataset, the crimes recorded here include boating violations, disorderly conduct, arson, homicide and many more. These crimes range in severity from felony convictions to misdemeanors. Of the crimes committed, 34133 were the result of the arrest of females, 119373 were males and 181 of unknown gender orientation. Because there are far more males present in this data set, the model built here may be more applicable to the male population. The mean age of those arrested was 34.8 years old.
For this report, we have sought to answer the question: “Is it possible to accurately profile who is likely to commit a crime in DC?” In the last few years, the use of racial profiling by law enforcement has been a controversial topic. Historically, police officers have had a large presence within primarily African American and Latino communities under the assumption that these areas are plagued with crime. As a result of this, there have many incidences of unlawful arrests and police brutality.
Structure of the Data Set
Within this dataset there was originally 28 variables. ObjectID was removed from this dataset because it corresponded only to the row number. Similarly, CCN and Arrest_Number were removed because these values corresponded to specifics about the arrest that were encrypted due to privacy concerns. After scrubbing the data, the variables left are shown below.
## 'data.frame': 153687 obs. of 25 variables:
## $ OBJECTID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ YEAR : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ MONTH : int 1 1 1 1 2 2 2 2 3 3 ...
## $ DAY : int 5 9 9 28 7 9 23 27 8 14 ...
## $ HOUR : int 11 15 17 10 7 4 18 6 15 9 ...
## $ AGE : int 34 23 44 23 22 50 30 23 27 27 ...
## $ DEFENDANT_PSA : int 0 403 0 505 103 0 602 505 603 607 ...
## $ DEFENANT_ISTRICT : int 0 4 0 5 1 0 6 5 6 6 ...
## $ Race : Factor w/ 4 levels "1","2","3","4": 1 2 3 3 3 3 3 3 3 3 ...
## $ ETHNICITY : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 NA 1 1 1 ...
## $ SEX : Factor w/ 3 levels "1","2","3": 1 2 2 2 2 2 2 2 3 2 ...
## $ CATEGORY : Factor w/ 29 levels "1","2","3","4",..: 2 6 22 11 22 1 1 1 6 2 ...
## $ DESCRIPTION : Factor w/ 997 levels "1","2","3","4",..: 664 816 737 910 737 883 904 904 751 429 ...
## $ ARREST_PSA : int 702 403 703 504 103 604 601 503 502 405 ...
## $ ARREST_DISTRICT : int 7 4 7 5 1 6 6 5 5 4 ...
## $ ARREST_BLOCKX : int 402600 398200 400700 400600 399000 406000 404200 402300 400000 400700 ...
## $ ARREST_BLOCKY : int 131700 143300 132800 139500 137300 135100 137500 138600 139300 142400 ...
## $ OFFENSE_BLOCKY : int NA NA NA NA NA NA NA NA NA NA ...
## $ OFFENSE_BLOCKX : int NA NA NA NA NA NA NA NA NA NA ...
## $ OFFENSE_PSA : int NA NA 706 NA 101 NA NA NA NA NA ...
## $ OFFENSE_DISTRICT : int NA NA 7 NA 1 NA NA NA NA NA ...
## $ ARREST_LATITUDE : num NA NA NA NA NA NA NA NA NA NA ...
## $ ARREST_LONGITUDE : num NA NA NA NA NA NA NA NA NA NA ...
## $ OFFENSE_LATITUDE : num NA NA NA NA NA NA NA NA NA NA ...
## $ OFFENSE_LONGITUDE: num NA NA NA NA NA NA NA NA NA NA ...
## OBJECTID YEAR MONTH DAY
## 0 0 0 0
## HOUR AGE DEFENDANT_PSA DEFENANT_ISTRICT
## 0 0 0 0
## Race ETHNICITY SEX CATEGORY
## 0 45899 0 0
## DESCRIPTION ARREST_PSA ARREST_DISTRICT ARREST_BLOCKX
## 0 813 813 1773
## ARREST_BLOCKY OFFENSE_BLOCKY OFFENSE_BLOCKX OFFENSE_PSA
## 1773 746 746 535
## OFFENSE_DISTRICT ARREST_LATITUDE ARREST_LONGITUDE OFFENSE_LATITUDE
## 527 2274 2274 746
## OFFENSE_LONGITUDE
## 746
Out of the 25 variables present in the cleaned up data set, 12 did not have any missing values. Twelve variables presented up to 2300 missing values. Considering the total number of points in this particular file (153,687), the aformentioned missing values is considerably small. Only one of the variables (Ethnicity) had a large number of missing values (45,899) and, for this reason, was not used in the statistical studies.
Correlation of Age and Type of Crime
##
## Call:
## glm(formula = SEX ~ CATEGORY, family = "binomial", data = sex_category)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.964 0.041 0.047 0.050 0.292
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.9580 0.2501 27.82 < 2e-16 ***
## CATEGORY2 0.2037 0.3485 0.58 0.55885
## CATEGORY3 -0.8807 0.4792 -1.84 0.06607 .
## CATEGORY4 0.8976 1.0310 0.87 0.38398
## CATEGORY5 -0.2560 0.3257 -0.79 0.43187
## CATEGORY6 -0.2293 0.3919 -0.59 0.55839
## CATEGORY7 0.1863 0.5126 0.36 0.71629
## CATEGORY8 -0.3466 0.5594 -0.62 0.53550
## CATEGORY9 -0.3305 0.5127 -0.64 0.51919
## CATEGORY10 -0.1405 0.3048 -0.46 0.64488
## CATEGORY11 -0.1645 0.4332 -0.38 0.70416
## CATEGORY12 -0.7519 0.3822 -1.97 0.04914 *
## CATEGORY13 -0.8496 0.4334 -1.96 0.04997 *
## CATEGORY14 -0.5889 0.7506 -0.78 0.43269
## CATEGORY15 -0.7485 0.4535 -1.65 0.09887 .
## CATEGORY16 0.2839 0.5592 0.51 0.61165
## CATEGORY17 -1.2500 0.5598 -2.23 0.02555 *
## CATEGORY18 -1.2475 0.5598 -2.23 0.02585 *
## CATEGORY19 -0.4069 1.0315 -0.39 0.69324
## CATEGORY20 -0.9035 1.0319 -0.88 0.38127
## CATEGORY21 12.6081 1007.2056 0.01 0.99001
## CATEGORY22 0.0304 1.0313 0.03 0.97645
## CATEGORY23 0.1161 1.0312 0.11 0.91033
## CATEGORY24 -0.0887 0.5593 -0.16 0.87399
## CATEGORY25 12.6081 512.0959 0.02 0.98036
## CATEGORY26 12.6081 533.0567 0.02 0.98113
## CATEGORY27 -1.6850 1.0333 -1.63 0.10296
## CATEGORY28 -0.4567 1.0315 -0.44 0.65797
## CATEGORY29 -3.8225 1.0517 -3.63 0.00028 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2803.2 on 153686 degrees of freedom
## Residual deviance: 2769.1 on 153658 degrees of freedom
## AIC: 2827
##
## Number of Fisher Scoring iterations: 18
By looking at the results from the Logit analysis, we can see that only 3 out of 28 categories are statistically significant. These categories are liquor Law Violations, damage to Property, burglary, offenses against family & children, and arson. As for the three statistically significant variables, category 29 has the lowest p-value suggesting a strong association of the sex of the offender with the probability of comitting arson.
The histogram and density lines indicate that the age at which the crimes are committed in DC seems to follow a bimodal distribution. Most of the crimes are committed by people with 18 years old. There is a steep decrease on crimes committed by people with ages between 25 and 42 years old. However, this trend is not sustained and the number of crimes committed by people with ~ 50 years old increases again and then decreases rapidly.
The Box-Whiskers Plot is very useful to understand the data distribution and to vizualize outliers. Here we can see that all but one category (Vending violations) is right-skewed. In a positive-skewed distribution, the mean and median are shifted to higher values when compared to the mode. In the plot above, the right-skewed distribution is most likely caused by the outliers.
Many assumptions can be made from this Box-Whiskers plot. For instance, robbery, Weapon Violations, homicide and gambling were mostly committed by younger people (late 20’s early 30’s). An opposite observation can be seen for Vending Violations, which was mostly commited by people in their 40’s and 50’s.
Violent Crimes Committed in Foggy Bottom/Great DC
The map above contains the crime committed between 2013-2017. The map is zoomed in Foggy Botton/Dupont Circle/George Town/downtown areas and the red, green, and blue circle corresponds to “Robbery”, Sex Abuse, and Homicide respectivally. Foggy Bottom and downtown is well knwon for being relatively safe. This map allows to visualize where the crimes were committed in the investigated area. We can note that robbery is fairly commom in different areas of DC, including GWU premises. However, only few sex abuse and homicides were reported in that area.
Sadly, the zoomed out map shows a different trend in DC’s neighborhoods. Sex Abuse and homicide crimes increases tremendously in the aformentioned areas. There is an extreme rise in the numbers of homicides and sex abuse in areas like Anacostia, Takoma Park, and Brentwood vicinities. However, the number of sex abuse is notably higher than homicides in the Southwest region.
Do people commit crime in their own neighborhood?
To further understand our SMART question, we wanted to examine whether people are more likely to commit a felony offense in their own neighborhood. Based on our findings, 37533 people, or 24%, of those that committed a crime in DC did so in their own neighborhood, while 76% of crimes were committed by residence of DC in a different neighborhood or by out-of-staters. In order to further understand these findings, we also sought to determine if the crimes committed closer to home were more violent offenses, white collar crimes or misdemeanors. For our purposes, we have defined violent crimes as including Aggravated assault, Assault on a Police Officer, Assault with a Dangerous Weapon, Sex Offenses, Kidnapping, Sex Abuse, Homicide, Weapon violations and Arson.
Interestingly, we found that 24.9% of the arrests made between 2013-2017 were of people who committed said crime within in their own neighborhood. Of this percentage, 51.3% of these crimes were the result of a violent crime. In comparison, 26.5% of the arrests documented here were made by people who lived outside DC or were out-of-staters, while the remaining portion was committed by DC residents who committed a crime in a different neighborhood. Of the crimes that out-of-staters were convicted of only 23.6% were violent. This could indicate that people are more likely to commit a violent crime within their own neighborhood. Similarly, the age spread for the violent crimes committed by those that live within DC in their own neighborhoods tended to be larger than for those who committed a crime in a different neighborhood.
To better understand the relationship of age and the type of crime committed we have created a density plot that is shown below. Based on this density plot crimes such as narcotics, arson, sex offenses, and gambling are more likely to be committed by people that are younger in age. Felonies involving offenses against family and children, liquor laws, and damage to property occur later in life. If this trend is true, younger people should be profiled more by police as having potentially committed a violent crime.
Building a Multivariate Model
Here, we have sought to determine whether police can accurately determine the type of crime a person is likely to commit based on variables present in this dataset, such as age, race, and ethnicity. First, we have sought to pick the features of this model through using the Bayesian information criterion (BIC). In analyzing the BIC plot shown below, we sought to use the model with the fewest predictors and the lowest BIC score. Based on this, the model that we will build here is hour, age, defendant PSA, race, sex, arrest PSA, arrest district, offense PSA and offense district.
To develop a multivariate model, we have split the data set into a training and test set. The training data set contained 67% of the values, while the test set contained 33% of the values. Observations were separated randomly, and the data set was scaled to the center.
The multivariate model was produced using the lda() function from the MASS package. Linear discriminate analysis (LDA) seeks to find a linear combination of features that can be used to characterize two or more classes of an event. Essentially this model is attempting to recognize a pattern between the physical variables to predict the crime committed. The coefficients of linear discriminants of this model are displayed below. Each LD can be multiplied by the predictor variable to determine the score for that respondent, which can then be used to compute the posterior probability of class membership.
## LD1 LD2 LD3 LD4 LD5 LD6 LD7
## Hour 9.92e-01 2.0004 6.39e-01 6.77e-01 6.59e-01 9.86e-01 7.85e-01
## Age 1.20e+00 0.7034 5.82e-01 5.90e-01 1.63e+00 9.75e-01 1.34e+00
## Defendant_PSA 1.19e+00 1.2870 1.68e+00 6.65e-01 1.83e+00 1.03e+00 5.79e-01
## Race 1.23e+00 1.4889 1.34e+00 9.90e-01 9.39e-01 1.39e+00 2.31e+00
## Sex 9.88e-01 0.6062 1.36e+00 5.46e-01 5.70e-01 1.12e+00 9.62e-01
## Offense_PSA 1.32e-11 0.0171 1.46e-06 2.95e+02 1.40e+04 1.44e+37 1.65e-08
## Offense_District 2.66e+10 53.7871 6.29e+05 3.27e-03 7.50e-05 5.39e-38 7.95e+07
Conclusion
Linear discriminate analysis assumes that the density of the data is gaussian and that all classes have covariance. This model has been shown to be well suited for multi-class analysis and similar to PCA it can be used as a dimensionality reduction technique. Despite this, when the amount of data for each arrest type is imbalanced in the training set, the model may be unable to accurately classify the observations in the test set. This is an accuracy problem that we will experience when using our model, because there are far less violent crimes in comparison to misdemeanors such as narcotics. Also, the LDA model requires a defined dimension. Higher-order interactions that may exist between the arrest types may therefore not be captured accurately by this model.
In conclusion, this data set has been found to have a wide variety of arrest types. Based on the findings in this report, we do not feel that the DC police are able to predict the profile of a person who is more likely to commit a crime in DC. There are two reasons why this is impossible. First, based on the BIC this model needs to include features such as the arrest district, and a person’s one police district. Often when police are profiling a potential criminal, they are making their arrests based on physical descriptions such as race and approximate age. As a result, our model would not actually be able to be employed by officers on the street. Similarly, in agreement with the literature “community policing” often involves some type of bias. To demonstrate the ability of a police officer to predict the crime a person may have committed the model would have to include some type of bias variable. Based on this current data set, community policing should not be used.